Fast Statistical Parsing of Noun Phrases for Document Indexing

نویسنده

  • ChengXiang Zhai
چکیده

Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are simply not efficient enough, and not robust enough, to handle a large amount of text. This paper proposes a new probabilistic model for noun phrase parsing, and reports on the application of such a parsing technique to enhance document indexing. The effectiveness of using syntactic phrases provided by the parser to supplement single words for indexing is evaluated with a 250 megabytes document collection. The experiment’s results show that supplementing single words with syntactic phrases for indexing consistently and significantly improves retrieval performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of Syntactic Phrase Indexing -- CLARIT NLP Track Report

The CLARIT NLP track e ort is focused on evaluating the usefulness of syntactic phrases for document indexing. The CLARIT system has several NLP techniques integrated with the vector space retrieval model [Evans et al. 91, Evans et al. 95]. The NLP techniques used in CLARIT include morphological analysis, robust noun-phrase parsing, and automatic construction of rst order thesauri, among others...

متن کامل

Recognising Complex Prepositions Prep+N+Prep as Negative Patterns in Automatic Term Extraction from Texts

This work is a study of the delimitation of complex prepositions (CP) as lexical units, items of a computational lexicon that includes compounds and phrases. In addition, given the utmost importance of spotting noun phrases (NP) in document retrieval systems, parsing prepositional structures such as “Prep1 N Prep2 X” prevents the fragment “N Prep2 X” from being detected as a noun phrase, i.e. t...

متن کامل

An Approach to Automatic Identification of Chinese Base Noun Phrases

This paper presents an approach to identify Chinese base noun phrases. This method is based on GLR algorithm and extends GLR parsing algorithm further. It is a mixed approach that combines rule-based method and statistical method by using PCFG system. From the experiment results, this method is not only simple but also feasible and efficient to base noun phrases identification.

متن کامل

Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation

phrases. The partial parser is motivated by an intuition (Abney, 1991): To acquire noun phrases from running texts is useful for many applications, such as word grouping, terminology indexing, etc. The reported literatures adopt pure probabilistic approach, or pure rule-based noun phrases grammar to tackle this problem. In this paper, we apply a probabilistic chunker to deciding the implicit bo...

متن کامل

A Dempster-Shafer Model for Document Retrieval using Noun Phrases

In this paper, we propose a document retrieval system based on natural language processing of documents and queries. We use single terms and term groups as indexing elements to represent documents and queries. The model is formally expressed within the Dempster-Shafer Theory of Evidence. We discuss in detail how we use this theory to represent a document collection, indexing elements, documents...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997